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ABSTRACT 

Objective To research computational methods for 
discovering body site and severity modifiers in clinical 
texts. 

Methods We cast the task of discovering body site and 
severity modifiers as a relation extraction problem in the 
context of a supervised machine learning framework. We 
utilize rich linguistic features to represent the pairs of 
relation arguments and delegate the decision about the 
nature of the relationship between them to a support 
vector machine model. We evaluate our models using 
two corpora that annotate body site and severity 
modifiers. We also compare the model performance to a 
number of rule-based baselines. We conduct cross- 
domain portability experiments. In addition, we carry out 
feature ablation experiments to determine the 
contribution of various feature groups. Finally, we 
perform error analysis and report the sources of errors. 
Results The performance of our method for discovering 
body site modifiers achieves F1 of 0.740-0.908 and our 
method for discovering severity modifiers achieves F1 of 
0.905-0.929. 

Discussion Results indicate that both methods perform 
well on both in-domain and out-domain data, 
approaching the performance of human annotators. 
The most salient features are token and named entity 
features, although syntactic dependency features also 
contribute to the overall performance. The dominant 
sources of errors are infrequent patterns in the data and 
inability of the system to discern deeper semantic 
structures. 

Conclusions We investigated computational methods 
for discovering body site and severity modifiers in clinical 
texts. Our best system is released open source as part of 
the clinical Text Analysis and Knowledge Extraction 
System (cTAKES). 



BACKGROUND AND SIGNIFICANCE 

It is widely accepted that the clinical narrative 
within electronic health records contains a substan- 
tial part of the patient's health information, but in 
its raw form does not represent computable data 
structures suitable for biomedical applications. 
Increasingly over the last decade the field of clinical 
natural language processing (NLP) has focused on 
developing methods for the semantic processing of 
clinical text that are use case and disease agnostic, 
and can thus be incorporated into a variety of clin- 
ical applications. The clinical NLP community has 
been converging around the use of conventions and 
standards for semantic processing to foster intra 
and inter-operability such as the unified medical 
language system 1 (UMLS), 2 Penn Treebank, 3 
PropBank, 4 TimeML 5 and Health Level 7. This 
shift from use case-specific applications to more 



general purpose and standards-based tools is char- 
acteristic of the last few years of clinical NLP 
efforts especially within the environment of mean- 
ingful use stage 2. 6 

The transformation of free text into a structured 
computable representation model is known as 
Information Extraction. 7 In the general NLP 
domain, such representation models have been 
defined by the NIST-sponsored Automatic Content 
Extraction (ACE) 8 and Text Analysis Conference 
(TAC) 9 shared tasks, which included templates for 
person and organization and template slots such as 
employee_of and city_of_residence. However, 
these representations are of little relevance to the 
clinical domain. Instead, representations such as the 
Consolidated Clinical Document Architecture 
(CCDA) for Meaningful Use Stage 2, the Clinical 
Element Model 10 (CEM) or the College of 
American Pathologists (CAP) protocols are more 
relevant. CCDA provides clinical and functional 
context for practical implementations of the Health 
Level 7 balloted standards 6 and can be thought of 
as the normalization target for electronic health 
records information. 

Body site and severity modifiers are two of the 
attributes (or template slots) associated with health- 
care representation models such as CCDA, CEM 
and the CAP These modifiers are usually attached 
to a disease/disorder, sign/symptom or procedure. 
Consider a sentence from a clinical record of a 
rheumatoid arthritis patient: He still is not able to 
work because of severe pain involving his wrists. In 
this sentence we would like to discover two facts: 
(1) that the body site of pain is the patient's wrists, 
and (2) that the severity level of pain is severe. 

There is earlier work on discovering tumor body 
sites from pathology notes. MedKAT/P 11 employs 
hand-built rules to populate a colon cancer tem- 
plate in which the body location of the primary 
tumor is one of the attributes. caTIES 12 identifies 
all tumor site mentions in pathology reports using 
regular expressions. Martinez and Li 13 explore a 
machine learning methodology for populating a 
colorectal cancer template with six attributes 
including the tumor site. They report an F score of 
58.1, for a model whose most predictive features 
are based on UMLS and SNOMED-CT. Jouhet 
et al 14 work with pathology notes from the French 
Poitou-Charentes Cancer Registry automatically to 
discover the primary tumor site and code to the 
International Classification of Diseases — Oncology 
(ICD-O) 15 codes using machine learning techni- 
ques. Kuvuluru et al 16 focus on extracting the 
generic ICD-O code for primary cancers reported 
in pathology reports. The body site of interest is 
the one of the primary tumor. MedLEE 17 18 'has 
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an integrated syntactic and semantic component which is rea- 
lized in the form of its grammar. The MedLEE grammar con- 
sists of a specification of semantic (and sometimes syntactic) 
components and is used to interpret the semantic properties of 
the individual terms and of their relations with other terms, and 
to generate a target output. The semantic grammar rules were 
developed based on co-occurrence patterns observed in clinical 
text.' MedLEE's scope includes processing radiology notes, dis- 
charge summaries and clinical reports. 

In this paper we demonstrate that the problem of body site and 
severity modifier discovery can be successfully treated as a relation 
extraction task, a well-established semantic processing task. 
Relation extraction focuses on determining the relationships 
between entities in text. We use the UMLS definitions to type the 
relations and the entities. In our sample sentence, the entities pain 
and wrists are the participants of the LocationOf relation and can 
be succinctly captured as LocationOf(wrists, pain). The relationship 
between the entity pain and the modifier severe can be expressed 
as DegreeOf(pain, severe). The first argument of the LocationOf 
relation is an anatomical site, while the second argument is a sign/ 
symptom, disease/disorder, or procedure. The first argument of the 
DegreeOf relation is either a sign/symptom or a procedure, while 
the second argument is a modifier (eg, significant, severe, marked). 

In general, semantic processing of language aims to capture 
the meaning behind the many surface forms that written lan- 
guage can assume. For example, the relationship we represented 
earlier as LocationOf (wrists, pain) is often also expressed in 
clinical notes as pain in his wrists, pains involving his wrists, 
wrist pain, or his main complaints of joint pain are presently at 
the wrists bilaterally. Because of this diversity of clinical lan- 
guage, a rule-based approach is hard to implement. Instead, we 
adopt a supervised machine learning approach, in which we pair 
up candidate clinical entities and delegate the decision about 
whether they participate in a relation to a supervised classifier. 

Supervised learning has been applied for relation extraction 
in the general domain. Feature-based methods 19 20 represent 
relation instances using carefully engineered sets of features. 
Kernel-based methods 21 22 make it possible to explore large (in 
some cases infinite) feature spaces automatically. In this work, 
we attempt both approaches and demonstrate that the feature- 
based approach is more promising for our task. In the clinical 
domain, relation extraction was the focus of the 2010 integrat- 
ing the biology and the bedside (i2b2)/\A shared task, 23 
although the targeted relations were very different from ours. 
A recent work 24 applied supervised learning for identifying ana- 
tomical locations of a small number of manually selected action- 
able findings in appendicitis-related radiology reports. Unlike 
their work, we do not limit the input of our system to a set of 
predefined findings; instead our system is potentially capable of 
identifying the anatomical sites for any sign/symptom, disease/ 
disorder, or procedure that exists in UMLS. Open information 
extraction 25 - 27 offers an alternative to supervised learning via 
the use of lightly supervised methods for extracting relations 
and their arguments from large collections of text. However, 
this work is not directly applicable to our task due to the diffi- 
culty of mapping the open set of relations to our relations of 
interest. 

Our main contributions are: 

1. We design and develop a machine learning system for disco- 
vering intra-sentential body site and severity modifiers from 
the clinical narrative, modeling the problem as a relation 
extraction task. 

2. We conduct feature ablation experiments to determine the 
most salient features for the task. 



Table 1 Description and statistics of the SHARP and ShARe 
corpora 



Corpus 


SHARP 


ShARe 


Type of notes 


Radiology, 


ICU notes, 




pathology, oncology 


discharge summaries 


Tokens 


70 704 


104 918 


Sentences 


4801 


8058 


Entity mentions 


11 781 


5541 


Entity mention pairs 


36 865 


6441 


LocationOf relations 


5025 


2190 


DegreeOf relations 


729 


702 


LocationOf agreement 


0.74 


0.80 


DegreeOf agreement 


0.87 


0.66 



ShARe, Shared Annotated Resource; SHARP, Strategic Health Advanced Research Project. 



3. We experiment with tree kernels, which have not been used 
in the past for relation extraction from the clinical narrative. 

4. We demonstrate that our models are highly portable across 
different types of notes. 

5. To allow result replication we make the gold standard 
corpus we used in our experiments available to the research 
community, and release our best-performing methods open 
source as part of the Apache clinical Text Analysis and 
Knowledge Extraction System 28 (cTAKES) 29 allowing repli- 
cation of experiments as well as adoption and improvements 
thus strengthening the clinical NLP ecosystem. 

MATERIALS AND METHODS 
Corpus 

In our experiments, we utilize two annotated corpora that have 
been in development for the past 3 years and that are now made 
available to the community through data use agreements with 
the contributing institution (to initiate the process, contact the 
last author) — the Strategic Health Advanced Research Project: 
area 4 (SHARP) 30 and the shared annotated resource (ShARe). 31 
Table 1 provides the high-level characteristics of the corpora 
and box 1 gives a few example annotations. 

The SHARP corpus provides several layers of annotations — 
syntax and semantics based on Treebank, PropBank and 
UMLS, 32 and normalization targets based on CEM. 33 The 
corpus consists of an equal amount of radiology notes, from 
Mayo Clinic peripheral arterial disease patients, and breast 
cancer oncology and pathology notes, from Seattle Group 
Health. The SHARP corpus is annotated for such clinical 
entities as drugs, diseases/disorders, signs/symptoms, procedures 
and anatomical sites. Diseases/disorders, sign/symptoms and 
procedures have body site modifiers expressed as a relation 



Box 1 Example annotations 



The [common femoral] had [moderate] [disease] without 
[stenosis]. 

LocationOf(common femoral, disease); LocationOf (common 
femoral, stenosis); DegreeOf (disease, moderate) 

The patient had a [[skin] tumor] removed from [behind his left 
ear]. 

LocationOf(skin, skin tumor); LocationOf(behind his left ear, 
skin tumor) 
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between the anchor and an anatomical site. Diseases/disorders 
and sign/symptoms have a severity modifier, expressed as a rela- 
tion between the anchor and a severity indicator normalized to 
none, slight, moderate or severe. At the time of our experiments, 
the 'seed' part of the SHARP corpus, consisting of 18 batches 
(subsections) and a total of 183 notes, was fully completed 
including double annotation and adjudication. We split this 
corpus into training (140 notes: batches 2-9, 13-16, 18-19), 
development (21 notes: batches 10, 17), and test (22 notes: 
batches 11, 12) sets. 

The ShARe corpus consists of MIMIC intensive care unit 
notes and discharge summaries as part of the PhysioNet 
project. 34 It annotates parts of speech (POS) and phrasal chunks 
consistent with the SHARP corpus. Annotated named entities 
are a subset of the SHARP types: anatomical sites and diseases/ 
disorders. The latter have body site and severity modifiers also 
consistent with the SHARP corpus. At the time of our experi- 
ments, the first 13 batches of the ShARe corpus were fully anno- 
tated and adjudicated. We used these 13 batches (130 notes) for 
our experiments. We split the set of notes into a training set (80 
notes), development set (25 notes), and test set (25 notes). The 
full details of the ShARe annotations will be described in a sep- 
arate paper; here we focus only on the relevant relation 
annotations. 

Inter-annotator agreement on these corpora is computed with 
Fl score. Human agreement typically suggests the upper bound 
of system performance but is not necessarily the ceiling. 

Classification task 

We view the problems of body site and severity modifier discov- 
ery as relation extraction tasks. Formally, we define a relation 
extraction task as: given two sets of entities, E and F, and a rela- 
tion, RCE xF, find all pairs (e, f)sR. Essentially, a relation extrac- 
tion task requires us to search over all pairs of entities in E and 
F, and identify the ones that participate in the relation R. The 
set E will contain entities like symptoms and diseases for both 
the body site (LocationOf) and severity (DegreeOf) relations, 
while the set F will contain anatomical sites for the Location-Of 
relation, and severity expressions for the DegreeOf relation. 

We cast this relation extraction task as a supervised learning 
problem. Given a pair of entities (e, /), we train a classifier to 
decide whether or not (e, f)sR. Thus, the classification task is 
binary and the classifier must assign each pair (e, f) one of the 
classes {R, No-R}. In particular, we focus on a sentence-level 
task, in which the classifier must look at all pairs of entities 
within a sentence, and learn to predict the class R if a relation 
was annotated between those two entities, and the class No-R if 
a relation was not annotated. We train two relation extraction 
classifiers, one for R= LocationOf and one for R= DegreeOf. 

In this paper, we train support vector machines (SVM) classi- 
fiers for these tasks. SVM perform well on a variety of NLP 
tasks. 3S 



Classifier features 

To train a classifier, we must characterize each (e, f) pair with a 
set of features that provide clues as to whether or not this pair 
of entities participates in the relation R. We utilize rich linguistic 
features including lexical, syntactic, and semantic features. 
Figure 1 illustrates the features. Many of our features are based 
on Zhou et al 1 and the best-performing systems 36 37 from 
2011 i2b2 challenge. 23 Below, we briefly summarize our features 
and refer the reader to these publications for details: 



► Token: the first and the last word of each entity, all words of 
the entity as a bag, the preceding and the following three 
words, and the number of words between the two entities 

► POS: the POS tags of each entity as a bag 

► Chunking: the head words of the syntactic base phrase 
chunks between the two entities 

► Dependency tree: the governing word and its POS tag for 
each entity's head word 

► Dependency path: the length of the path through the 
dependency from each entity to their common ancestor, and 
the path between the two entities as a string 

► Named entity: the number of entities between the two 
entities, UMLS types of both entities, and whether the first 
entity is enclosed in the second one (or vice versa). 

We also experimented with tree kernel features, which have 
been used successfully for relation extraction and semantic role 
labeling in the general domain. 38 39 Tree kernels offer a general- 
ized approach to representing syntactic features. An instance is 
represented by some phrase structure context, and the similarity 
between two instance structures is computed by taking a 
weighted sum of similar substructures (see Collins and Duffy 40 
for details). In this work, we use a representation called 
path-enclosed tree, 39 which, starting from a complete automatic 
parse of a sentence, represents each potential relation instance 
with the smallest sub-tree in the sentence containing both argu- 
ments. In addition, new nodes labeled ARGl-{type} and 
ARG2-{type} are inserted into the tree above the lowest node 
that dominates the respective arguments, where {type} repre- 
sents the UMLS semantic type of the argument. All features are 
generated automatically by cTAKES, which includes a POS 
tagger, a UMLS dictionary lookup, a phrase-chunker and the 
dependency parser from Albright et al. 41 

Classifier parameters 

In addition to a set of features, most supervised classifiers have a 
set of parameters that are not set during the learning process, 
and must be separately specified. SVMs have several such para- 
meters, including the cost of misclassification (SVMq), the kernel 
type (SVM/ ier „ e i, eg, linear vs radial basis function), and add- 
itional kernel-specific parameters (eg, SVM gamma in the radial 
basis function kernel). To address specific issues associated with 
entity-relation data, our models include several additional classi- 
fier parameters beyond the standard SVM parameters. 

Learning from imbalanced data is a central challenge in train- 
ing relation extraction systems. Recall that we generate training 
instances with classes {R, No-R} for all pairs of entities within 
each sentence. As most entities and modifiers in a sentence are 
unrelated, we typically end up with significantly more negative 
than positive examples. Without additional guidance, most clas- 
sifiers learn to favor the more dominant class. Thus, our models 
include a down-sampling parameter, P neg , to address this imbal- 
ance. During training, this parameter is used randomly to 
discard negative (ie, dominant class) examples with probability 

We also consider, as a classifier parameter, a variation to the 
classification paradigm. Note that in the standard binary classi- 
fier approach described above, if there is any overlap between 
sets E and F we may have to classify two entities e and f twice: 
once for the pair (e, f) and once for the pair (f, e). An alterna- 
tive to this approach is to train a three-way classifier. We first 
order all of the entities by their location in the clinical text, and 
then pair up entities only with other entities that are later in the 
text. This means that we will see only (e, /) or (/, e), but not 
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OB] 



t 

IN 



He still is not able to work because of 



V 

jj 

severe 



V 

NN 

pain 



Sign/ 
Symptom 



T 

PRP$ 



involving his 



T 

NNS 



wrists 



Anatomical site 



DegreeOf 



Token features for LocationOf: 

Words of argO: pain 

Words of argl: wrists 

Words between entities: involving, his 



Dependency features for LocationOf: 

Path: pain-NN-PMOD-VBG-OBJ-NNS-wrists 
Governing word of argO: because 
Governing word of argl: involving 



POS tag features for LocationOf: 

POS tags of argO: NN 
POS tags of argl: NNS 



Named entity features for LocationOf: 

Number of entities between args: 0 
UMLS Entity type of argO: sign/symptom 
UMLS Entity type of argl: anatomical site 



Figure 1 Some of the features used to predict the LocationOf relation in an example sentence. 



both. Now, we train our classifier to assign each pair (x, y) one 
of three classes: {_R, R -1 , No-R}, where the class R indicates that 
the relation R(x, y) is present, the class R~ 2 indicates that the 
relation R(y, x) is present, and No-R indicates that there is no 
relation for either ordering of the entities. Thus in our set-up, 
we have a strategy parameter that is set to one of {2-class, 
3 -class}. 

The tree kernel requires the setting of a parameter X, which 
represents a discount of larger tree structures. In addition, tree 
kernels can be used on their own or incorporated with other 
features in a composite kernel, which takes a weighted linear 
combination of a traditional feature kernel with a tree kernel. 

Experimental set-up 

Models are evaluated on these corpora using measures com- 
monly employed in NLP — namely precision, recall and Fl 
score. 7 42 

To set the various model parameters (SVMc, SVMkemeh 
SVM Hamma , P neg , strategy), models are trained on the training set 
and evaluated on the development set. We explore the space of 
possible parameter settings using a grid search, training one 
model for each set of parameters. The parameter settings for 
the model with the highest Fl on the development set are used 
to train a model on the combination of the training and devel- 
opment data. This final model is then evaluated on the testing 
data. Note that this parameter tuning is performed separately 
for the DegreeOf and LocationOf models, so the two models 
may have different parameter settings chosen by their respective 
grid searches. For the tree kernel parameters, we set A = 0.4, and 
use a composite kernel (combining the grid search-optimized 
feature kernel and the tree kernel), normalizing both kernels 
and giving them equal weight. These tree kernel parameters can 
be optimized using a grid search, but it is computationally quite 
expensive to train tree kernels, so we set the parameters based 
on values found to perform well in previous work. 

We implement five rule-based baselines to which we compare 
the performance of our system. The first four baselines only link 
pairs of entities that have appropriate entity types for their 
respective relations (DegreeOf or LocationOf). The first baseline 



predicts relations only in sentences with exactly two entities. 
The second baseline searches for sentences with one or more 
modifiers (anatomical site for LocationOf, severity for 
DegreeOf) and exactly one other entity, and predicts a relation 
between the entity and the closest modifier. The third baseline 
associates each modifier with the nearest entity, as long as there 
is no intervening modifier. The fourth baseline predicts a rela- 
tion only between entities that are enclosed in the same noun 
phrase. The fifth baseline approximates a grammar/rule-based 
system. It trains an SVM model using only the dependency path 
feature (with words on both ends replaced with their UMLS 
semantic types), essentially allowing the SVM to memorize 
dependency paths between clinical entities that are likely indica- 
tors of LocationOf or DegreeOf relations. We train a model 
using only this feature, tuning the model parameters on the 
development set. 

RESULTS 
Model tuning 

As described earlier, a grid search over possible parameter set- 
tings was performed using the training and development data. 
This search determined that for the SHARP corpus, the best 
parameters for the LocationOf classifier were SVMc=100.0, 
SVMkemei— radial basis function, SVM gamma =0.01, P neg =0.S 
and strategy =3 -class; the best parameters for the DegreeOf 
classifier were SVM C =0.05, SVM.^ erne i=linear, P neg =1.0 and 
strategy =3 -class. For the ShARe corpus, the best parameters for 
the LocationOf classifier were SVMq= 100.0, SVM)i er „ e i=radial 
basis function, SVM gamma = 0.001, P nes ,= 1.0 and strategy =3- 
class; the best parameters for the DegreeOf classifier were 
SVMc=0.S, SVMf, er „ e i=linear, P neg =1.0 and strategy =3 -class. 

So, for most models, the 3 -class strategy was most effective 
and downsampling was not necessary. 

Model evaluation 

In this section we conduct an evaluation on a held-out test set, 
which provides an estimate of the system performance that can 
be achieved in practice. For each corpus, we train the models 
for the LocationOf and DegreeOf relations on the combination 
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of the training and development data, and using the parameters 
determined in the previous section. We then evaluate the 
models against the test set. We also evaluate the baseline models 
on the same test set. To assess the portability of our models we 
also evaluate the models trained on the SHARP training and 
development sets against the ShARe test set. Results are shown 
in table 2. We do not test a ShARe-trained model on the 
SHARP test set because ShARe annotates only a subset of the 
SHARP entity types (see the Corpus section). So for example, a 
ShARe model will never see a procedure mention in the ShARe 
training data, but would be asked to find relations for procedure 
mentions in the SHARP test set. 

Feature ablation experiments 

To quantify the utility of each feature group, we performed all- 
but-one feature ablation experiments on the development set. 
That is, we left out each feature group, retrained the model, and 
evaluated it on the development set. We report the results for 
the SHARP corpus in table 3. 

DISCUSSION 

The results of our evaluation indicate that for both LocationOf 
and DegreeOf, model performance is high — typically achieving 
the same level as the human agreement. The performance of the 
DegreeOf model is consistently higher than that of the 
LocationOf model, probably because the task of discovering 
DegreeOf relations is easier: on average, the arguments of a 
DegreeOf relation are 0.61 tokens apart, while the arguments of 



Table 2 Model performance for on the SHARP and ShARe test 
sets 

Relation Test cor 

LocationOf SHARP 



Table 3 Performance of models with various features removed 



ShARe 



DegreeOf SHARP 



ShARe 



Model 


Precision 


Recall 


F1 


Baseline 1 


0.900 


0.096 


0.174 


Baseline 2 


0.910 


0.198 


0.325 


Baseline 3 


0.858 


0.431 


0.574 


Baseline 4 


0.551 


0.522 


0.536 


Baseline 5 


0.758 


0.340 


0.470 


SVM trained on SHARP 


0.786 


0.699 


0.740 


Composite (TK+features) 


0.828 


0.661 


0.735 


Human agreement 






0.744 


Baseline 1 


1.000 


0.356 


0.525 


Baseline 2 


1.000 


0.381 


0.552 


Baseline 3 


0.971 


0.777 


0.863 


Baseline 4 


0.521 


0.700 


0.598 


Baseline 5 


0.941 


0.556 


0.699 


SVM trained on ShARe 


0.953 


0.867 


0.908 


SVM trained on SHARP 


0.916 


0.883 


0.899 


Human agreement 






0.800 


Baseline 1 


1.000 


0.044 


0.084 


Baseline 2 


1.000 


0.044 


0.084 


Baseline 3 


0.907 


0.857 


0.881 


Baseline 4 


0.896 


0.758 


0.821 


Baseline 5 


0.860 


0.473 


0.610 


SVM trained on SHARP 


0.869 


0.945 


0.905 


Composite (TK+features) 


0.840 


0.923 


0.880 


Human agreement 






0.871 


Baseline 1 


0.944 


0.121 


0.214 


Baseline 2 


0.947 


0.128 


0.225 


Baseline 3 


0.977 


0.887 


0.929 


Baseline 4 


0.929 


0.745 


0.827 


Baseline 5 


0.404 


0.979 


0.571 


SVM trained on ShARe 


0.929 


0.929 


0.929 


SVM trained on SHARP 


0.926 


0.887 


0.906 


Human agreement 






0.664 



Included features 


LocationOf 




DegreeOf 




F1 


AF1 


F1 


AF1 


All 


0.776 




0.972 




No token features 


0.742 


-0.034 


0.909 


-0.063 


No POS features 


0.768 


-0.008 


0.963 


-0.009 


No chunking features 


0.766 


-0.010 


0.972 


0 


No named entity features 


0.712 


-0.064 


0.904 


-0.068 


No dependency tree features 


0.757 


-0.019 


0.944 


-0.028 


No dependency path features 


0.755 


-0.021 


0.954 


-0.018 



ShARe, Shared Annotated Resource; SHARP, Strategic Health Advanced Research 
Project. 



SHARP, Strategic Health Advanced Research Project. 



a LocationOf relation are 3.22 tokens apart, and for the 
DegreeOf relation, the classifier had to consider only 2643 can- 
didate entity pairs (28% of which were true relations), but for 
LocationOf it had to consider 36 865 pairs (14% of which were 
relations). 

On the SHARP corpus, the SVM models outperformed all 
five rule-based baselines. On the ShARe corpus, the SVM 
LocationOf model outperformed all five baselines, but for 
DegreeOf, baseline 3 performed as well as the SVM. Baseline 3, 
which added relations for adjacent modifiers and entities, gener- 
ally had good performance on DegreeOf, in which the argu- 
ments were on average only 0.61 tokens apart. However, for 
LocationOf, the baseline did not perform as well as the SVM 
models, which could handle better the more distant and 
complex relations. This was especially true on the SHARP 
corpus, in which the SVM model outperformed baseline 3 by 

0. 166 Fl (0.740 vs 0.574). 

Across different corpora, the results are consistently better 
when the evaluation is conducted on the ShARe corpus. The 
difference is probably due to the fact that the ShARe project 
annotated fewer entity types than SHARP, making the task of 
discovering body site and severity modifiers simpler. But we also 
found that when evaluating on the ShARe test data, a model 
trained on the SHARP data performs almost as well as a model 
trained on the ShARe data, indicating that the SHARP model is 
fairly portable to other domains. 

Our feature ablation experiments indicate that most features 
contribute to the overall system performance. Across both rela- 
tion types, the most important feature group is the named entity 
type features, followed by the token features, which is consistent 
with the findings in the general domain. 20 Unlike in the general 
domain, where chunking features appear to be among the 
largest contributors, in our experiments the chunking features 
did not improve the performance by much. Similarly, tree 
kernel features did not improve performance, contrary to 
several studies in the general domain. Finally, similar to the 
general domain, the dependency features provided only a 
modest boost to the system performance. 

To analyze the sources of errors, we manually reviewed 50 
LocationOf errors the system made on the SHARP data. Out of 
those 50, 22 instances were due to an error in the human anno- 
tations and 28 instances were actual system errors. It appears 
that the system errors could be attributed to one of three 
sources: 

1. Sentence segmentation errors (one instance) 

2. Infrequent patterns in training data (eight instances) 

3. Inability of the system to discern more complex semantic 
patterns (19 instances). 
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An example of (2) is that the system mistakenly discovered 
LocationOf(abdominal aorta, aortogram) in Aortogram: Patent 
abdominal aorta that tapers from approximately 20 m... prob- 
ably due to the frequent appearance of a similar pattern in the 
data, for example, Lungs: Equal AE bilaterally, no rales, no 
rhonchi in which Lungs (anatomical site) appears in a similar 
position as Aortogram (procedure). An example of (3) is that 
the system erroneously identified LocationOf(feet, femoropopli- 
teal disease) in Non-invasive studies suggest significant femoro- 
popliteal disease with monophasic Doppler signals in the feet 
probably due to incorrectly attaching the PP with monophasic 
Doppler signals to femoropopliteal disease even though such 
attachment does not make sense semantically. 
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CONCLUSION 

We presented a methodology for the discovery of two key attri- 
butes from the clinical narrative — body site and severity. We 
showed that the task can be successfully cast as a supervised 
machine learning relation extraction problem, and that key fea- 
tures include the surrounding tokens and UMLS named entities. 
The best-performing methods identify LocationOf relations 
with Fl of 0.740-0.908 and DegreeOf relations with Fl of 
0.905-0.929. These models are implemented as modules within 
cTAKES, thus providing an open source end-to-end system to 
the community for research and direct use purposes. In add- 
ition, the developed framework represents a general purpose 
utility for the semantic task of relation extraction thus contribut- 
ing to the clinical NLP ecosystem. 

This work focused on the discovery of body site and severity 
modifiers of clinical entities within the same sentence. 
Extending this work to inter-sentential relations will probably 
require leveraging sophisticated discourse processing including 
coreference resolution and in some cases textual entailment. 
Another challenge is relation discovery with underspecified, 
omitted or implicit information. For example, a mass men- 
tioned in a breast cancer pathology report without an explicit 
anatomical site implies that the location is highly likely to be 
the breast. 

The work described here is a step towards building a classifi- 
cation framework for relation discovery from the clinical narra- 
tive. Although in this work, we focused on DegreeOf and 
LocationOf relations, our system is easily extendable to many 
other relation types. In fact, to include new relations, no soft- 
ware changes are required; it is sufficient simply to include the 
examples of new relation types in the training data. The SHARP 
corpus currently includes several other UMLS relation types 
such as manages/treats and causes/brings_about. We are planning 
to retrain our system to include these relations in the near 
future. Our next steps will also include the implementation of 
the best methods in translational science applications such as 
phenotyping for the electronic medical record and genomics, 
informatics for i2b2, automatic disease activity classification 43 as 
part of the pharmacogenomics research network, and clinical 
question answering as part of the multi-source integrated plat- 
form for answering clinical questions. 44 
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